PARALLEL COMPUTER ARCHITECTURES

Similar documents
Computer Architecture

Dr. Joe Zhang PDC-3: Parallel Platforms

Chapter 9 Multiprocessors

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Computer parallelism Flynn s categories

Chapter 11. Introduction to Multiprocessors

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Parallel Architecture. Sathish Vadhiyar

Lecture 2 Parallel Programming Platforms

Lecture 24: Virtual Memory, Multiprocessors

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

Module 5 Introduction to Parallel Processing Systems

Overview. Processor organizations Types of parallel machines. Real machines

Parallel Programming Platforms

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

COSC 6385 Computer Architecture - Multi Processor Systems

Multiprocessors - Flynn s Taxonomy (1966)

Embedded Systems Architecture. Multiprocessor and Multicomputer Systems

Multiprocessors & Thread Level Parallelism

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Scalability and Classifications

Advanced Parallel Architecture. Annalisa Massini /2017

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Three basic multiprocessing issues

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS Parallel Algorithms in Scientific Computing

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Introduction to parallel computing

Unit 9 : Fundamentals of Parallel Processing

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures

Lecture 2. Memory locality optimizations Address space organization

Multi-Processor / Parallel Processing

Copyright 2010, Elsevier Inc. All rights Reserved

COSC 6374 Parallel Computation. Parallel Computer Architectures

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

BlueGene/L (No. 4 in the Latest Top500 List)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

WHY PARALLEL PROCESSING? (CE-401)

Types of Parallel Computers

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Objectives of the Course

Performance study example ( 5.3) Performance study example

Portland State University ECE 588/688. Cray-1 and Cray T3E

JNTUWORLD. 1. Discuss in detail inter processor arbitration logics and procedures with necessary diagrams? [15]

INTELLIGENCE PLUS CHARACTER - THAT IS THE GOAL OF TRUE EDUCATION UNIT-I

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Numerics, WT 2013/ Introduction

PIPELINE AND VECTOR PROCESSING

Chapter 18 Parallel Processing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Parallel Computing Platforms

Crossbar switch. Chapter 2: Concepts and Architectures. Traditional Computer Architecture. Computer System Architectures. Flynn Architectures (2)

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Learning Curve for Parallel Applications. 500 Fastest Computers

Intro to Multiprocessors

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Parallel Architecture, Software And Performance

Chapter 17 - Parallel Processing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Handout 3 Multiprocessor and thread level parallelism

RISC Processors and Parallel Processing. Section and 3.3.6

Parallel Architectures

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Embedded Systems Architecture. Computer Architectures

Chapter 5. Multiprocessors and Thread-Level Parallelism

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Chap. 4 Multiprocessors and Thread-Level Parallelism

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Parallel Architectures

Taxonomy of Parallel Computers, Models for Parallel Computers. Levels of Parallelism

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Chapter 18. Parallel Processing. Yonsei University

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

2. Parallel Architectures

CSE 392/CS 378: High-performance Computing - Principles and Practice

Dheeraj Bhardwaj May 12, 2003

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel Computing Introduction

UNIT I (Two Marks Questions & Answers)

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer Organization. Chapter 16

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Transcription:

8 ARALLEL COMUTER ARCHITECTURES 1

CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different CU.

M M M M rivate memory CU CU M M M M Messagepassing interconnection network M M Messagepassing interconnection network M M M M M M (a) (b) Figure 8-2. (a) A multicomputer with 16 CUs, each with each own private memory. (b) The bit-map image of Fig. 8-1 split up among the 16 memories.

Machine 1 Machine 2 Machine 1 Machine 2 Machine 1 Machine 2 Application Application Application Application Application Application Language run-time system Language run-time system Language run-time system Language run-time system Language run-time system Language run-time system Operating system Operating system Operating system Operating system Operating system Operating system Hardware Hardware Hardware Hardware Hardware Hardware Shared memory (a) Shared memory (b) Shared memory (c) Figure 8-3. Various layers where shared memory can be implemented. (a) The hardware. (b) The operating system. (c) The language runtime system.

(a) (b) (c) (d) (e) (f) (g) (h) Figure 8-4. Various topologies. The heavy dots represent switches. The CUs and memories are not shown. (a) A star. (b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.

CU 1 A Input port Output port B End of packet C D Four-port switch Middle of packet CU 2 Front of packet Figure 8-5. An interconnection network in the form of a fourswitch square grid. Only two of the CUs are shown.

CU 1 Four-port switch Input port Output port A B A B A B Entire packet C D C D C D CU 2 Entire packet Entire packet (a) (b) (c) Figure 8-6. Store-and-forward packet switching.

CU 1 A B CU 2,CU 4 CU 3 C D Four-port switch Input port Output buffer Figure 8-7. Deadlock in a circuit-switched interconnection network.

60 50 Linear speedup N-body problem 40 Speedup 30 Awari 20 10 0 0 10 20 30 40 Number of CUs 50 60 Skyline matrix inversion Figure 8-8. Real programs achieve less than the perfect speedup indicated by the dotted line.

n CUs active Inherently sequential part otentially parallelizable part 1 CU active f 1 f f 1 f T ft (1 f)t/n (a) (b) Figure 8-9. (a) A program has a sequential part and a parallelizable part. (b) Effect of running part of the program in parallel.

CU Bus (a) (b) (c) (d) Figure 8-10. (a) A 4-CU bus-based system. (b) A 16-CU bus-based system. (c) A 4-CU grid-based system. (d) A 16- CU grid-based system.

1 2 3 1 Work queue 1 2 Synchronization point 3 1 2 3 2 4 5 6 1 2 3 3 Synchronization point 7 8 rocess 9 (a) (b) (c) (d) Figure 8-11. Computational paradigms. (a) ipeline. (b) hased computation. (c) Divide and conquer. (d) Replicated worker.

hysical Logical (hardware) (software) Examples Multiprocessor Shared variables Image processing as in Fig. 8-1 Multiprocessor Message passing Message passing simulated with buffers in memory Multicomputer Shared variables DSM, Linda, Orca, etc. on an S/2 or a C network Multicomputer Message passing VM or MI on an S/2 or a network of Cs Figure 8-12. Combinations of physical and logical sharing.

Instruction streams Data streams Name Examples 1 1 SISD Classical Von Neumann machine 1 Multiple SIMD Vector supercomputer, array processor Multiple 1 MISD Arguably none Multiple Multiple MIMD Multiprocessor, multicomputer Figure 8-13. Flynn s taxonomy of parallel computers.

arallel computer architectures SISD SIMD MISD MIMD (Von Neumann)? Vector processor Array processor Multiprocessors Multicomputers UMA COMA NUMA M COW Bus Switched CC-NUMA NC-NUMA Grid Hypercube Shared memory Message passing Figure 8-14. A taxonomy of parallel computers.

Input vectors Vector ALU Figure 8-15. A vector ALU.

Operation Examples A i = f 1 (B i ) f 1 = cosine, square root Scalar = f 2 (A) f 2 = sum, minimum A i = f 3 (B i, C i ) f 3 = add, subtract A i = f 4 (scalar,b i ) f 4 = multiply B i by a constant Figure 8-16. Various combinations of vector and scalar operations.

Step Name Values 1 Fetch operands 1.082 10 12 9.212 10 11 2 Adjust exponent 1.082 10 12 0.9212 10 12 3 Execute subtraction 0.1608 10 12 4 Normalize result 1.608 10 11 Figure 8-17. Steps in a floating-point subtraction.

Cycle Step 1 2 3 4 5 6 7 Fetch operands B 1,C 1 B 2,C 2 B 3,C 3 B 4,C 4 B 5,C 5 B 6,C 6 B 7,C 7 Adjust exponent B 1,C 1 B 2,C 2 B 3,C 3 B 4,C 4 B 5,C 5 B 6,C 6 Execute operation B 1 + C 1 B 2 + C 2 B 3 + C 3 B 4 + C 4 B 5 + C 5 Normalize result B 1 + C 1 B 2 + C 2 B 3 + C 3 B 4 + C 4 Figure 8-18. A pipelined floating-point adder.

A B S T 8 24-Bit address registers 64 24-Bit holding registers for addresses 8 64-Bit scalar registers 64 64-Bit holding registers for scalars 64 Elements per register 8 64-Bit vector registers ADD ADD ADD ADD MUL Address units BOOLEAN SHIFT MUL RECI. BOOLEAN SHIFT O. COUNT Scalar integer units Scalar/vector floatng-point units Vector integer units Figure 8-19. Registers and functional units of the Cray-1

1 CU Write 100 2 Write 200 Read 2x x 3 W100 W200 R3 = 200 R3 = 200 R4 = 200 W100 R3 = 100 W200 R4 = 200 R3 = 200 W200 R4 = 200 W100 R3 = 100 R4 = 100 Read 2x R4 = 200 R4 = 200 R3 = 100 4 (a) (b) (c) (d) Figure 8-20. (a) Two CUs writing and two CUs reading a common memory word. (b) - (d) Three possible ways the two writes and four reads might be interleaved in time.

Write CU A 1A 1B 1C 1D 1E 1F CU B 2A 2B 2C 2D CU C 3A 3B 3C Synchronization point Time Figure 8-21. Weakly consistent memory uses synchronization operations to divide time into sequential epochs.

Shared memory rivate memory Shared memory CU CU M CU CU M CU CU M Cache Bus (a) (b) (c) Figure 8-22. Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories.

Action Local request Remote request Read miss Fetch data from memory Read hit Use data from local cache Write miss Update data in memory Write hit Update cache and memory Invalidate cache entry Figure 8-23. The write through cache coherence protocol. The empty boxes indicate that no action is taken.

(a) CU 1 CU 2 CU 3 A Memory CU 1 reads block A Exclusive Cache Bus (b) CU 1 CU 2 CU 3 A Memory CU 2 reads block A Shared Shared Bus (c) CU 1 CU 2 CU 3 A Memory CU 2 writes block A Modified Bus (d) CU 1 CU 2 CU 3 Memory A A Shared Shared Bus CU 3 reads block A (e) CU 1 CU 2 CU 3 A Memory CU 2 writes block A Modified Bus (f) CU 1 CU 2 CU 3 A Memory CU 1 writes block A Modified Bus Figure 8-24. The MESI cache coherence protocol.

Memories 000 001 010 011 100 101 110 111 Crosspoint switch is open 000 001 010 (b) CUs 011 100 Crosspoint switch is closed 101 110 111 Closed crosspoint switch Open crosspoint switch (c) (a) Figure 8-25. (a) An 8 8 crossbar crosspoint. (c) A closed crosspoint. switch. (b) An open

16 16 Crossbar switch (Gigaplane-XB) Transfer unit is 64-byte cache block Board had 4 GB + 4 CUs Four address buses for snooping UltraSARC CU 0 1 2 1-GB memory 14 15 module Figure 8-26. The Sun Enterprise 10000 symmetric multiprocessor.

A B X Y Module Address Opcode Value (a) (b) Figure 8-27. (a) A 2 2 switch. (b) A message format.

3 Stages CUs Memories 000 001 b 1A 2A b 3A b 000 001 010 011 100 101 1B 1C b 2B 2C 3B 3C 010 011 100 101 110 a a 110 111 1D a 2D a 3D 111 Figure 8-28. An omega switching network.

CU Memory CU Memory CU Memory CU Memory MMU Local bus Local bus Local bus Local bus System bus Figure 8-29. A NUMA machine based on two levels of buses. The Cm* was the first multiprocessor to use this design.

Directory Node 0 Node 1 Node 255 CU Memory Local bus CU Memory Local bus CU Memory Local bus Interconnection network (a) Bits 8 18 6 2 18-1 Node Block Offset (b) 4 3 2 1 0 0 0 1 0 0 82 (c) Figure 8-30. (a) A 256-node directory-based multiprocessor. (b) Division of a 32-bit memory address into fields. (c) The directory at node 36.

CU with cache Intercluster interface Memory Intercluster bus (nonsnooping) 0 D D D D 1 2 3 4 D D D D 5 6 7 8 D D D D 9 10 11 12 D D D D 13 14 15 Local bus (snooping) Cluster Directory (a) Cluster 0123456789 15 Block State This is the directory for cluster 13. This bit tells whether cluster 0 has block 1 of the memory homed here in any of its caches. 3 2 1 0 Uncached, shared, modified (b) Figure 8-31. (a) The DASH architecture. (b) A DASH directory.

Quad board with 4 entium ros and up to 4 GB of RAM Snooping bus interface Directory controller 32-MB cache RAM Directory Data pump IQ board SCI ring RAM CU Figure 8-32. The NUMA-Q multiprocessor.

Local memory table at home node Bits 6 7 13 6 Back State Tag Fwd 2 19-1 Back State Tag Fwd Back State Tag Fwd 0 Node 4 cache directory Node 9 cache directory Node 22 cache directory Figure 8-33. SCI chains all the holders of a given cache line together in a doubly-linked list. In this example, a line is shown cached at three nodes.

CU Memory Node Local interconnect Disk and I/O Local interconnect Disk and I/O Communication processor High-performance interconnection network Figure 8-34. A generic multicomputer.

Network Disk Tape GigaRing Alpha Mem Alpha Mem Alpha Mem Shell Control + E registers Control + E registers Control + E registers Node Commun. processor Commun. processor Commun. processor Full-duplex 3D torus Figure 8-35. The Cray Research T3E.

64-Bit local bus Kestrel board 38 ro ro 64 MB I/O NIC 32 ro ro 64 MB I/O NIC 2 64-Bit local bus (a) (b) Figure 8-36. The Intel/Sandia Option Red system. (a) The kestrel board. (b) The interconnection network.

CU group 0 1 2 3 4 5 6 7 CU group 0 1 2 3 4 5 6 7 CU group 0 1 2 3 4 5 6 7 Time 1 4 3 6 2 5 8 9 1 5 3 9 2 6 8 4 7 9 1 8 3 6 4 2 5 7 7 (a) (b) (c) Figure 8-37. Scheduling a COW. (a) FIFO. (b) Without headof-line blocking. (c) Tiling. The shaded areas indicate idle CUs.

CU CU CU Backplane acket going east (a) acket going west Line card Ethernet (b) Switch Figure 8-38. (a) Three computers on an Ethernet. (b) An Ethernet switch.

CU 1 2 3 4 Cell 5 6 acket 7 8 ort Virtual circuit 9 10 11 12 ATM switch 13 14 15 16 Figure 8-39. Sixteen CUs connected by four ATM switches. Two virtual circuits are shown.

Globally shared virtual memory consisting of 16 pages 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 2 5 1 3 6 4 7 11 13 15 9 8 10 12 14 Memory CU 0 CU 1 CU 2 CU 3 (a) Network 0 2 5 1 3 6 4 7 11 13 15 9 10 8 12 14 CU 0 CU 1 CU 2 CU 3 (b) 0 2 5 1 3 6 4 7 11 13 15 9 10 8 10 12 14 CU 0 CU 1 CU 2 CU 3 (c) Figure 8-40. A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (a) The initial situation. (b) After CU 0 references page 10. (c) After CU 1 references page 10, here assumed to be a read-only page.

( abc, 2,5) ( matrix-1, 1, 6, 3.14) ( family, is sister, Carolyn, Elinor) Figure 8-41. Three Linda tuples.

Object implementation stack; top:integer; # storage for the stack stack: array [integer 0..N-1] of integer; operation push(item: integer); function returning nothing begin stack[top] := item; push item onto the stack top := top + 1; # increment the stack pointer end; operation pop( ): integer; # function returning an integer begin guard top > 0 do # suspend if the stack is empty top := top - 1; # decrement the stack pointer return stack[top]; # return the top item od; end; begin top := 0; end; # initialization Figure 8-42. A simplified ORCA stack object, with internal data and two operations.