Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Similar documents
Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Issues in Multiprocessors

Issues in Multiprocessors

Multiprocessors - Flynn s Taxonomy (1966)

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Processor Architecture and Interconnect

Parallel Computing Platforms

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer parallelism Flynn s categories

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Lecture 9: MIMD Architectures

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Aleksandar Milenkovich 1

CMSC 611: Advanced. Parallel Systems

WHY PARALLEL PROCESSING? (CE-401)

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessors & Thread Level Parallelism

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architecture

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Convergence of Parallel Architecture

Computer Systems Architecture

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Chapter 18 Parallel Processing

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Multi-Processor / Parallel Processing

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Uniprocessor Computer Architecture Example: Cray T3E

Chapter 9 Multiprocessors

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Three basic multiprocessing issues

Computer Systems Architecture

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Shared Memory Architecture Part One

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Computer Organization. Chapter 16

Chapter 5. Multiprocessors and Thread-Level Parallelism

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer Architecture

Lecture 24: Memory, VM, Multiproc

Organisasi Sistem Komputer

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Computer-System Organization (cont.)

1. Memory technology & Hierarchy

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

COSC4201 Multiprocessors

Module 5 Introduction to Parallel Processing Systems

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Learning Curve for Parallel Applications. 500 Fastest Computers

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Memory Systems in Pipelined Processors

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chapter 5. Multiprocessors and Thread-Level Parallelism

Scalable Distributed Memory Machines

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Architectures

ECE 669 Parallel Computer Architecture

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Shared Symmetric Memory Systems

Multiprocessors and Locking

Unit 9 : Fundamentals of Parallel Processing

CS370 Operating Systems

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Lecture 24: Virtual Memory, Multiprocessors

Part IV. Chapter 15 - Introduction to MIMD Architectures

Parallel and High Performance Computing CSE 745

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Handout 3 Multiprocessor and thread level parallelism

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Parallel Computing: Parallel Architectures Jin, Hai

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

4. Networks. in parallel computers. Advances in Computer Architecture

EE382 Processor Design. Illinois

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Thread- Level Parallelism. ECE 154B Dmitri Strukov

An Introduction to Parallel Programming

CS370 Operating Systems

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Chapter 2 Parallel Computer Architecture

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Chapter-4 Multiprocessors and Thread-Level Parallelism

Transcription:

A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1

Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor systems. However, due to a number of difficult problems, the potential of these machines has been difficult to realize. This is because of the: 2

Advances in technology Rate of increase in performance of uni-processor, Complexity of multiprocessor system design This drastically effected the cost and implementation cycle. Programmability of multiprocessor system Design complexity of parallel algorithms and parallel programs. 3

Programming a parallel machine is more difficult than a sequential one. In addition, it takes much effort to port an existing sequential program to a parallel machine than to a newly developed sequential machine. Lack of good parallel programming environments and standard parallel languages also has further aggravated this issue. 4

As a result, absolute performance of many early concurrent machines was not significantly better than available or soonto-be available uni-processors. 5

Recently, there has been an increased interest in large-scale or massively parallel processing systems. This interest stems from many factors, including: Advances in integrated technology. Very high aggregate performance of these machines. Cost performance ratio. Widespread use of small-scale multiprocessors. Advent of high performance microprocessors. 6

Integrated technology Advances in integrated technology is slowing down. Chip density integrated technology now allows all performance features found in a complex processor be implemented on a single chip and adding more functionality has diminishing returns 7

Integrated technology Studies of more advanced superscalar processors indicates that they may not offer a performance improvement more than 2 to 4 for general applications. 8

Aggregate performance of machines Cray X-MP (1983) had a peak performance of 235 MFLOPS. A node in an Intel ipsc/2 (1987) with attached vector units had a peak performance of 10 MFLOPS. As a result, a 256-processor configuration of Intel ipsc/2 would offer less than 11 times the peak performance of a single Cray X-MP. 9

Aggregate performance of machines Today a 256-processor system might offer 100 times the peak performance of the fastest systems. Cray C90 has a peak performance of 952 MFLOPS, while a 256-processor configuration using MIPS R8000 would have a peak performance of 76,800 MFLOPS. 10

Economy Cray C90 or Convex C4 which exploit IC and packaging technology cost about $1 million. High end microprocessor cost somewhere between $ 2000-5000. By using a small number of microprocessors, equal performance can be achieved at a fraction of the cost of more advanced supercomputers. 11

Small-scale multiprocessors Due to the advances in technology, multiprocessing has come to the desktops and high-end personal computers. This has led to improvements in parallel processing hardware and software. 12

In spite of these developments, many challenges need to overcome in order to exploit the potential of large-scale multiprocessors: Scalability Ease of programming 13

Scalability: Maintaining the cost-performance of a uni-processor while linearly increasing overall performance as processors are added. Ease of programming: Programming a parallel system is inherently more difficult than programming a uni-processor where there is a single thread of control. Providing a single shared address space to the programmer is one way to alleviate this problem Shared-memory architecture. 14

Shared-memory eases the burden of programming a parallel machine since there is no need to distribute data or explicitly communicate data between the processors in software. Shared-memory is also the model used on small-scale multiprocessors, so it is easier to transport programs that have been developed for a small system to a larger shared-memory system. 15

A taxonomy of MIMD organizations Two parameters effect the organization of an MIMD system: The way processors communicate with each other, Organization of the global memory 16

A taxonomy of MIMD organization Traditionally, based on the way processors communicate with each other, multiprocessor systems can be classified into: Message passing systems (also called distributed memory systems) Intel ipsc, Paragon ncube2, IBM SP1 and SP2 Shared-memory systems IBM RP3, Cray X-MP, Cray Y-MP, Cray C90, Sequent Symmetry 17

In message passing multiprocessor systems (distributed memory), processors communicate with each other by sending explicit messages. In shared-memory multiprocessor systems the processors are more tightly coupled. Memory is accessible to all processors, and communication among processors is through shared variables or messages deposited in shared memory buffers. 18

Message passing System (programmer s view) Processor Memory Processor Memory Processor Memory Processors communicate via explicit messages 19

Message passing System (programmer s view) In a message passing machine, the user must explicitly communicate all information passed between processors. Unless the communication patterns are very regular, the management of this data movement is very difficult. Note, multi-computers are equivalent to this organization. It is also referred to as No Remote Memory Access Machine (NORMA). 20

Shared-memory System (programmer s view) Processor Processor Processor Processors communicate indirectly via shared variables stored in main memory 21

Shared-memory System (programmer s view) In a shared-memory machine, processors can distinguish communication destination, type, and value through shared-memory addresses. There is no requirement for the programmer to manage the movement of data. 22

Message passing multiprocessor Processor Network Interface Memory Processor Network Interface Memory Interconnection Network 23

Shared-Memory Multiprocessor Processor Processor Processor Interconnection Network Global Shared Memory 24

A shared-memory architecture allows all memory locations to be accessed by every processors. This helps to ease the burden of programming a parallel system, since: There is no need to distribute data or explicitly communicate data between processors in the system. Shared-memory is the model adopted on small-scale multiprocessors, so it is easier to map programs that have been parallelized for small-scale systems to a larger shared-memory system. 25

Inter-processor communication is also critical in comparing the performance of message passing and shared-memory machines: Communication in message passing environment is direct and is initiated by the producer of data. In shared-memory system, communication is indirect, and producer usually moves data no further than memory. The consumer, then has to fetch the data from the memory decrease in performance. 26

In a shared-memory system, communication requires no intervention on the part of a runtime library or operating system. In a message passing environment, access to the network port is typically managed by the system software. This overhead at the sending and receiving processors makes the start up costs of communication much higher (usually 10s to 100s of microsecond). 27

As a result of high communication cost in a message passing organization either performance is compromised or a coarser grain parallelism, and hence more limitation on exploitation of available parallelism, must be adapted. Note that the start up overhead of communication in shared-memory organization is typically on the order of microseconds. 28

Communication in shared-memory systems is usually demand-driven by the consuming processor. The problem here is overlapping communication with computation. This is not a problem for small data items. However, it can degrade the performance if there is frequent communication or a large amount of data is exchanged. 29

Consider the following case, in a message passing environment: A producer process wants to send 10 words to a consumer process. In a typical message passing environment with blocking send and receive protocol this problem would be coded as: Producer process Send (proc i, process j, @sbuffer, num-bytes); Consumer process Receive (@rbuffer, max-bytes); 30

This code usually is broken into the following steps: The operating system checks protections and then programs the network DMA controller to move the message from the sender s buffer to the network interface. A DMA channel on the consumer processor has been programmed to move all messages to a common system buffer. When the message arrives, it is moved from the network interface to the system buffer and an interrupt is posted to the processor. 31

The receiving processor services the interrupt and determines which process the message is intended for. It then copies the message to the specified receiver buffer and reschedules the program on the processor s ready queue. The process is dispatched on the processor and reads the message from the user s receive buffer. 32

On a shared-memory system there is no operating system involved and the process can transfer data using a shared data area. Assuming the data is protected by a flag indicating its availability and the size, we have: Producer process For i := 0 to num-bytes buffer [i] := source [i]; Flag := num-bytes; Consumer process While (Flag = = 0); For i := 0 to Flag dest [i] := buffer [i]; 33

For the message passing environment, the dominant factors are the operating system overhead; programming the DMA; and the internal processing. For the shared-memory environment, the overhead is primarily on the consumer reading the data, since it is then that data moves from the global memory to the consumer processor. 34

Thus, for a short message the shared-memory system is much more efficient. For long messages, the message-passing environment has similar or possibly higher performance. 35

Message passing vs. Shared-memory systems The message passing systems minimize the hardware overhead. A single-thread performance of a message passing system is as high as a uni-processor system, since memory can be tightly coupled to a single processor. Shared-memory systems are easier to program. 36

Message passing vs. Shared-memory systems Overall, the shared-memory paradigm is preferred since it is easier to use and it is more flexible. A shared-memory organization can emulate a message passing environment while the reverse is not possible without a significant performance degradation. 37

Message passing vs. Shared-memory systems Shared-memory systems offer lower performance than message passing systems if communication is frequent and not overlapped with computation. In shared-memory environment, interconnection network between processors and memory usually requires higher bandwidth and more sophistication than the network in message passing environment. This can increase overhead costs to the level that the system does not scale well. 38

Message passing vs. Shared-memory systems Solving the latency problem through memory caching and cache coherence is the key to a shared-memory multiprocessor. 39

A taxonomy of MIMD organization In an MIMD organization, the memory can be: Centralized Distributed Note it is natural to assume that the memory in message passing system in distributed. 40

A taxonomy of MIMD organization As a result, based on the memory organization, shared memory multiprocessor systems are classified as: Uniform Memory Access (UMA) systems, and Non-uniform Memory Access (NUMA) systems Note, most large-scale shared memory machines utilize a NUMA structure. 41

Distributed Shared-Memory Multiprocessor Architecture (NUMA) Processor Network Interface Memory Processor Network Interface Memory Interconnection Network 42

Distributed Shared-Memory Multiprocessor Architecture (NUMA) NUMA is a shared memory system in which access time varies with the location of the memory word. 43

Cache Only Memory Architecture (COMA) This is a special class of NUMA in which distributed global main memory are caches. 44

Cache Only Memory Architecture (COMA) P 0 P 1 P n-1 Cache Cache Cache Attraction Memory Attraction Memory Attraction Memory Interconnection Network 45

Uniform Memory Access Architecture Processor Processor Processor Interconnection Network Memory Memory Memory 46

Uniform Memory Access Architecture The physical memory is uniformly shared by all the processors all processors have equal access time to all memory words. Each processor may use a private cache. UMA is equivalent to tightly coupled multiprocessor class. When all processors have equal access to peripheral devices, the system is called a symmetric multiprocessor. 47

Symmetric Multiprocessor Processor Processor Processor Interconnection Network I/O Memory Memory Memory 48

Shared Memory Multiprocessor To conclude, a practical way to achieve concurrency is through a collection of small to moderate scale processors that provides a global physical address space and symmetric access to all of the main memory of processors and peripheral devices Symmetric Multiprocessor (SMP). 49

Symmetric Multiprocessor Each processor has its own cache, all the processors and memory modules are attached to the same interconnect Usually a shared bus. The ability to access all shared data efficiently and uniformly using ordinary load and store operations, and Automatic movement and replication of date in the local caches makes this organization attractive for concurrent processing. 50

Shared Memory Multiprocessor Since all communication and local computations generate memory accesses in a shared address space, from a system architecture s perspective, the key high level design issue lies in the organization of memory hierarchy. In general there are the following choices: 51

Shared Memory Multiprocessor Shared cache P 0 P 1 P n-1 Interconnection Switch Network Interleaved (1 st level) cache Interleaved Main Memory 52

Shared Memory Multiprocessor Shared cache This platform is more suitable for small scale configuration 2 to 8 processors. It is a more common approach for multiprocessor on chip. 53

Shared Memory Multiprocessor Bus based Shared Memory P 0 P 1 P n-1 Cache Cache Cache Memory I/O 54

Shared Memory Multiprocessor Bus based Shared Memory This platform is suitable for small to medium configuration 20 to 30 processors. Due to its simplicity it is very popular. Bandwidth of he shared bus is the major bottleneck of the system. This configuration also is known as Cache Coherent Shared Memory Configuration. 55

Shared Memory Multiprocessor Dance Hall P 0 P 1 P n-1 Cache Cache Cache Interconnection Network Memory Memory 56

Shared Memory Multiprocessor Dance Hall This platform is symmetric and scalable. All memory modules are far away from processors. In large configurations, several hops are needed to access memory. 57

Shared Memory Multiprocessor Distributed Memory P 0 P n-1 Cache Cache Memory Memory Interconnection Network 58

Shared Memory Multiprocessor Distributed Memory This configuration is not symmetric, as a result, locality should be exploited to improve performance. 59

Distributed Shared-memory architecture As noted, it is possible to build a shared-memory machine with a distributed-memory organization. Such a machine has the same structure as the message passing system, but instead of sending messages to other processors, every processor can directly address both its local memory and remote memories of other processors. This model is referred to as distributed-shared-memory (DSM) organization. 60

Shared Memory Multiprocessor distributed-shared-memory (DSM) organization is also called non-uniform memory access architecture (NUMA). This is in contrast to the uniform memory access structure (UMA) as shown before. UMA is easier to program than NUMA. Most of the small-scale shared-memory systems are based on UMA philosophy and large-scale shared-memory systems are based on NUMA. 61

Shared Memory Multiprocessor In most shared-memory organizations, processors and memory are separated by an interconnection network. For small-scale shared-memory systems, the interconnection network is a simple bus. For large-scale shared-memory systems, similar to the message passing systems, we use multistage networks. 62

Shared Memory Multiprocessor In all shared memory multiprocessor organization, the concept of cache is becoming very attractive in reducing the memory latency and the required bandwidth. In all design cases, except shared cache, each processor has at least one level of private cache. This raises the issue of cache coherence as an attractive research directions. 63