Three basic multiprocessing issues

Similar documents
Three parallel-programming models

Learning Curve for Parallel Applications. 500 Fastest Computers

Evolution and Convergence of Parallel Architectures

ECE 669 Parallel Computer Architecture

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Convergence of Parallel Architecture

Parallel Programming Models and Architecture

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

Parallel Computing Platforms

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Parallel Architecture Fundamentals

Parallel Programming Platforms

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

Conventional Computer Architecture. Abstraction

Dr. Joe Zhang PDC-3: Parallel Platforms

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Performance study example ( 5.3) Performance study example

CS Parallel Algorithms in Scientific Computing

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architecture

Issues in Multiprocessors

COSC 6374 Parallel Computation. Parallel Computer Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Issues in Multiprocessors

Parallel Architectures

PARALLEL COMPUTER ARCHITECTURES

ECE/CS 250 Computer Architecture. Summer 2016

Lecture 7: Parallel Processing

COSC 6374 Parallel Computation. Parallel Computer Architectures

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

Scalable Distributed Memory Machines

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

Chapter 8 : Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 1: Parallel Architecture Intro

Chapter 18 Parallel Processing

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Convergence of Parallel Architectures

Interconnection Network

What are Clusters? Why Clusters? - a Short History

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

Cray XE6 Performance Workshop

Characteristics of Mult l ip i ro r ce c ssors r

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Chapter 9 Multiprocessors

EE382 Processor Design. Processor Issues for MP

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer-System Organization (cont.)

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Snoop-Based Multiprocessor Design III: Case Studies

Computer Systems Architecture

The Cache-Coherence Problem

Introduction. Distributed Systems. Introduction. Introduction. Instructor Brian Mitchell - Brian

Lecture notes for CS Chapter 4 11/27/18

Chapter Seven Morgan Kaufmann Publishers

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Multi-Processor / Parallel Processing

Parallel Architectures

COSC 6385 Computer Architecture - Multi Processor Systems

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

CSCI 4717 Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 23 Database System Architectures

Computer Architecture

Distributed OS and Algorithms

Multiprocessor Support

Parallel Computer Architecture

Lecture 7: Parallel Processing

Memory Systems in Pipelined Processors

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

ECE 669 Parallel Computer Architecture

Module 5 Introduction to Parallel Processing Systems

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Transcription:

Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated with the tasks in a program are requirements on when the tasks must execute. Some tasks may be executed at any time. Some tasks may not be initiated before certain other tasks have completed. At run time, the scheduler attempts to arrange the tasks in such a way as to minimize the overall program execution time. 3. Synchronization and communication. In order for the program to be correct, different tasks have to be synchronized with each other. The architecture has to provide primitives for synchronization. In order to be efficient, the tasks must communicate rapidly. An important communication issue is memory coherence: If separate copies of the same data are held in different memories, a change to one must be immediately reflected in the others. Exactly how efficient the communication must be depends on the grain size of the partitioning. Grain size rogram construct Typical # instrs. Fine grain Basic block 5 10 rocedure 20 100 Medium grain Small task 100 1,000 Intermediate task 1,000 100,000 Coarse grain Large task > 100,000 Lecture 2 Architecture of arallel Computers 1

A parallel program runs most efficiently if the grain size is neither too large nor too small. What are some disadvantages of having a too-large grain size? What would be some disadvantages of making the grain size too small? The highest speedup is achieved with grain size somewhere in between. Speedup Limited parallelism, load imbalance High overhead Coarse Optimum Grain size Fine Task overheads depend on the cache organization. A large cache means a Shared caches tend to task overhead. the overhead. A task returning to a processor that still has many of its instructions and data in its cache will face a lower overhead than if it started on a different processor. A program is broken into tasks in one of the following ways: 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2002 2

Explicit statement of concurrency in the higher-level language, such as in CS or occam. These allow programmers to divide their programs into tasks and specify the communications between them. The use of hints in the source code, which the compiler may choose to use or ignore. arallelism is detected by a sophisticated compiler, such as arafrase. These use techniques like data-dependency graphs to transform serial code into parallel tasks. Sometimes too much parallelism is found in a program; then tasks are clustered for efficient execution. This may be done when there aren t enough processors to execute all the tasks in parallel. It may be profitable to group the tasks if their working sets intersect heavily. Multiprocessors vs. networks In a multiprocessor, the different processors (nodes) are under the control of a single operating system. In a network (e. g., local-area network) each node is autonomous and can run without the assistance of a central operating system (though it may not have access to all the peripherals). In general, a network is physically distributed while a multiprocessor is centralized. In a multiprocessor, all processors do not have to be up in order for the system to run. But not every processor can do useful work by itself (e.g., it may not be attached to any peripherals at all). By contrast, network nodes often communicate by transferring files (although some local-area networks now support paging). Lecture 2 Architecture of arallel Computers 3

Tightly coupled vs. loosely coupled multiprocessors The next step up from a network is a multiprocessor that has no shared memory. Such a multiprocessor is called loosely coupled. Communication is by means of inter-processor messages. A loosely coupled multiprocessor is often called a message-passing or distributed-memory multiprocessor. What is Gustafson s term for this? A multiprocessor is called tightly coupled if it has a shared memory. The communication bandwidth is on the order of the memory bandwidth. A tightly coupled multiprocessor is often called a sharedmemory multiprocessor What is Gustafson s term for this? Tightly coupled processors support a coupled processors. grain size than loosely Tightly coupled multiprocessors share memory in several different ways. Among these are Shared data cache/shared memory. Separate data cache but shared bus to memory. Separate data caches with separate buses to shared memory. Separate processors and separate memory modules interconnected via a multistage switch. A separate multilevel cache instead of a shared memory. There is a middle ground between tightly coupled and loosely coupled multiprocessors, machines with a private memory for each processor, where processors are capable of accessing each others memories with a certain time penalty. 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2002 4

Sometimes these machines are called non-uniform memory access (NUMA) machines. The BBN Butterfly is an early example. A more recent example is the Silicon Graphics Origin architecture. In any tightly coupled multiprocessor, the processors are all connected to the memory and peripherals via some sort of interconnection structure, or interconnect. This looks like this: devices Mem Mem Mem Mem ctrl ctrl Interconnect Interconnect rocessor rocessor This is sometimes implemented by providing a back door to allow the processor to bypass the interconnection structure for accesses to its private memory. It can be done for any multistage interconnection network, as well as the crossbar shown at the right. rocessors Crossbar Switch Memories Another approach is to build a system in clusters. Lecture 2 Architecture of arallel Computers 5

roc. roc. roc. roc. Cache Cache Cache Cache Local switch Local switch Dir. Mem. Dir. Mem. Global switch interconnection Communication architectures Each type of parallel architecture has developed its own programming model. Application Software Systolic Arrays Dataflow System Software Architecture Shared Memory SIMD Message assing It is useful to visualize a parallel architecture in terms of layers: CAD Multiprogramming Database Scientific modeling Shared address Message passing Data parallel arallel applications rogramming models Compilation Communication abstraction or library User/system boundary Operating systems support Communication hardware hysical communication medium Hardware/software boundary Three parallel programming models stand out. 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2002 6

Shared-address programming is like using a bulletin board where you can communicate with colleagues. Message-passing is like communicating via e-mail or telephone calls. There is a well defined event when a message is sent or received. Data-parallel programming is a regimented form of cooperation. Many processors perform an action separately on different sets of data, then exchange information globally before continuing en masse. User-level communication primitives are provided to realize the programming model There is a mapping between language primitives of the programming model and these primitives These primitives are supported directly by hardware, or via OS, or via user software. In the early days, the kind of programming model that could be used was closely tied to the architecture. Today Compilers and software play important roles as bridges Technology trends exert a strong influence The result is convergence in organizational structure, and relatively simple, general-purpose communication primitives. A shared address space In the shared address-space model, processes can access the same memory locations. Communication occurs implicitly as result of loads and stores This is convenient. Wide range of granularities supported. Lecture 2 Architecture of arallel Computers 7

Similar programming model to time-sharing on uniprocessors, except that processes run on different processors Good throughput on multiprogrammed workloads. Naturally provided on wide range of platforms. History dates at least to precursors of mainframes in early 60s Wide range of scale: few to hundreds of processors This is popularly known as the shared memory model. But this term is ambiguous. Why? The shared address-space model A process is a virtual address space plus ortions of the address spaces of tasks are shared. Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space n private Load n 1 2 Common physical addresses 0 Store Shared portion of address space 2 private rivate portion of address space 1 private 0 private 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2002 8

What does the private region of the virtual address space usually contain? Conventional memory operations can be used for communication. Special atomic operations are used for synchronization. The interconnection structure The interconnect in a shared-memory multiprocessor can take several forms. It may be a crossbar switch. C Each processor has a direct connection to each memory and controller. C Bandwidth scales with the number of processors. M M M M Unfortunately, cost scales with CS&G call this the mainframe approach. At the other end of the spectrum is a shared-bus architecture. C C M M $ $ All processors, memories, and controllers are connected to the bus. Such a multiprocessor is called a symmetric multiprocessor (SM). Latency larger than for uniprocessor Lecture 2 Architecture of arallel Computers 9

Low incremental cost Bus is bandwidth bottleneck A compromise between these two organizations is a multistage interconnection network. The processors are on one side, and the memories and controllers are on the other. 0 1 2 0 2 1 0 4 1 0 1 2 Each memory reference has to traverse the stages of the network. Why is this called a compromise between the other two strategies? 3 4 5 6 7 3 4 6 5 7 5 2 6 3 7 Stage 0 Stage 1 Stage 2 3 4 5 6 7 For small configurations, however, a shared bus is quite viable. Current shared-address architectures Example: Intel entium ro four-processor quad pack. CU Interrupt controller Bus interface 256-KB L 2 $ -ro module -ro module -ro module -ro bus (64-bit data, 36-bit address, 66 MHz) CI bridge CI bridge Memory controller CI cards CI bus CI bus MIU 1-, 2-, or 4-way interleaved DRAM Each of the four processors contains 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2002 10

a entium-pro processor, translation lookaside buffer, first-level caches, a 256-KB second-level cache, an interrupt controller, and a bus interface. All of these are on a single chip connecting directly to a 64-bit memory bus. Example: Sun Enterprise. This is based on a 256- bit memory bus, delivering 2.5 GB/s of memory bandwidth. There may be up to 16 cards, where each card is either a CU/memory card, or complete system. CU/memory card contains two UltraSparc processor modules, each with 16-KB level-1 and 512-KB level-2 caches, two 512-bit wide memory banks, and an internal switch. The card contains $ $ 2 Gigaplane bus (256 data, 41 address, 83 MHz) 100bT, SCSI Bus interface SBUS $ $ 2 SBUS Mem ctrl Bus interface/switch three SBUS slots for extensions, a SCSI connector, a 100 bt Ethernet port, and two FiberChannel interfaces. SBUS 2 FiberChannel CU/mem cards cards Lecture 2 Architecture of arallel Computers 11

Although memory is packaged with the processors, it is always accessed over the common bus, making this a uniform memoryaccess multiprocessor. A full configuration would be 24 processors and 6 cards. The Cray T3E: For larger numbers of processors, a NUMA design is essential. It is designed to scale up to 1024 processors, with 480 MB/s communication links. rocessors are DEC Alphas. Topology is a 3-dimensional cube, with each node connected to its six neighbors. External $ Mem Mem ctrl and NI XY Switch Z Any processor can access any memory location. A short series of instructions is used to establish addressability to remote memory, which can then be accessed via conventional loads and stores. Remote data is not cached there is no way to keep it coherent. 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2002 12