Lecture 9: MIMD Architecture - PDF Free Download

Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected together. In contrast to a SIMD computer, a MIMD computer can execute different programs on different processors. Flexibility! P 1 P 8 m 8 P 9 P 5 P 6 P 17 m 6 m 18 m 7 P 7 P 18 m m 19 12 P 19 P m 20 12 P 20 P 13 P 22 P 10 P 21 m m 23 10 m 14 m P 23 P 24 11 m 11 m 25 P 14 m 15 m 5 m 17 m 9 m 13 m21 m 22 P 24 P 32 m 34 P 28 P 23 P 22 m 23 P 31 P 23 m 40 m 41 P34 P 35 P 36 P 33 P 39 m 42 P 37 P 38 Irregular Task Parallelism (Intelligent cruise controller) Zebo Peng, IDA, LiTH 2 1

Introduction (Cont d) MIMD processors work asynchronously, and don t have to synchronize with each other. Local clocks can run at higher speeds. At any time, different processors may be executing different instructions on different steams of data. They can be built from commodity (off-the-shelf) microprocessors with relatively little effort. They are also highly scalable, provided that an appropriate memory organization is used. Most current parallel computers are built based on the MIMD architecture. Supercomputers have a global MIMD organization, even though locally they use SIMD for vector processing. Zebo Peng, IDA, LiTH 3 SIMD vs MIMD A SIMD computer requires less hardware than a MIMD computer (single control unit). However, SIMD processors are specially designed, and tend to be expensive and have long design cycles. Not all applications are naturally suited to SIMD processors. Conceptually, MIMD computers cover also SIMD need. By letting all processors executing the same program (single program multiple data streams SPMD). On the other hand, SIMD architectures are more power efficient, since one instruction controls many computations. For regular vector computation, it is still more appropriate to use a SIMD machine (e.g., implemented in GPUs). Zebo Peng, IDA, LiTH 4 2

MIMD Processor Classification Centralized Memory: Shared memory located at centralized location consisting usually of several interleaved modules the same distance from any processor. Symmetric Multiprocessor (SMP) Uniform Memory Access (UMA) Distributed Memory: Memory is distributed to each processor improving scalability. Message Passing Architectures: No processor can directly access another processor s memory. Hardware Distributed Shared Memory (DSM): Memory is distributed, but the address space is shared. Non-Uniform Memory Access (NUMA) Software DSM: A level of OS built on top of message passing multiprocessor to give a shared memory view to the programmer. Zebo Peng, IDA, LiTH 5 MIMD with Shared Memory CU1 IS PE1 DS CU2... IS PE2... DS Bottleneck Shared Memory CUn IS PEn DS Tightly coupled, not very scalable. Typically called Multi-processor systems. Zebo Peng, IDA, LiTH 6 3

MIMD with Distributed Memory CU1 IS PE1 DS LM1 CU2... IS PE2... DS LM2... Interconnection Network CUn IS PEn DS LMn Loosely coupled with message passing, scalable. Typically called Multi-computer systems. Zebo Peng, IDA, LiTH 7 Shared-Address-Space Platforms Part (or all) of the memory is accessible to all processors. Processors interact by modifying data objects stored in this shared-address-space. P P... P Interconnection Network M M... M UMA (uniform memory access): the time taken by a processor to access any memory word is identical. NUMA, otherwise. Memory Zebo Peng, IDA, LiTH 8 4

Multi-Computer Systems A multi-computer system is a collection of computers interconnected by a message-passing network. Ex. clusters. Each processor is an autonomous computer: With its own local memory. There is not a shared-address-space any longer. They communicating with each other through a message passing network. They can be easily built with off-the-shelf microprocessors. Zebo Peng, IDA, LiTH 9 MIMD Design Issues Design decisions related to an MIMD machine are very complex, since it involves both architecture and software issues. The most important issues include: Processor design (functionality and capability) Physical organization Interconnection structure Inter-processor communication protocols Memory hierarchy Cache organization and coherency Operating system design Parallel programming languages Application software techniques Zebo Peng, IDA, LiTH 10 5

Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 11 Symmetric Multiprocessors (SMP) A set of similar processors of comparable capacity. Each processor can perform the same functions (symmetric). The processors are connected by a bus or other interconnections. They share the same memory. A single memory or a set of memory modules. Memory access time is the same for each processor. All processors share access to I/O. Either through the same channels or different channels giving paths to the same devices. All processors are controlled by an integrated operating system. Zebo Peng, IDA, LiTH 12 6

An SMP Example Zebo Peng, IDA, LiTH 13 High performance SMP Advantages If similar work can be done in parallel (e.g., scientific computing). Good availability Since all processors can perform the same functions, failure of a single processor does not stop the system. Support incremental growth Users can enhance performance by adding additional processors. Good scalability Vendors can offer range of products based on different number of processors. Zebo Peng, IDA, LiTH 14 7

SMP based on Shared Bus Critical for performance Zebo Peng, IDA, LiTH 15 Shared Bus SMP Pros and Cons Advantages: Simple and cheap. Flexibility easy to expand the system by attaching more processors. Disadvantages: Performance limited by the bus bandwidth. Each processor should have local cache: To reduce number of bus accesses. It leads to problems with cache coherence. Should be solved in hardware (to be discussed in Lecture 10). Zebo Peng, IDA, LiTH 16 8

Multi-Port Memory SMP Direct access of memory modules by several processors. Better performance than a bus-based SMP: Each processor has dedicated path to memory module. All the ports can be active simultaneously. With the restriction that only one write can be performed into a memory location at a time. Zebo Peng, IDA, LiTH 17 Multi-Port Memory SMP (Cont d) Cache Cache Hardware logic required to resolve conflicts (parallel writes). Permanently designated priorities to each memory port. Can configure portions of memory as private to one or more processors (improved also security). Write-through cache policy requires to alert other processors of memory updates. Zebo Peng, IDA, LiTH 18 9

Operating System Issues OS should manage the processor resources so that the user perceives them as a single system. It should appear as a single-processor multiprogramming system. In both SMP and uniprocessor, several processes may be active at one time. OS is responsible for scheduling their execution and allocating resources. A user may build applications with multiple processes without regard to if a single processor or multiple processors are available. OS addresses the issues of reliability and fault tolerance. Graceful degradation. Zebo Peng, IDA, LiTH 19 IBM S/390 Mainframe SMP Processor unit (PU): CISC microprocessor Frequently used instructions hard wired 64k L1 unified cache with 1 cycle access time Zebo Peng, IDA, LiTH 20 10

IBM S/390 Mainframe SMP L2 cache 384k Zebo Peng, IDA, LiTH 21 IBM S/390 Mainframe SMP Bus switching network adapter (BSN) Includes 2M of L3 cache Zebo Peng, IDA, LiTH 22 11

IBM S/390 Mainframe SMP Memory card 8G per card Zebo Peng, IDA, LiTH 23 Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 24 12

Memory Access Approaches Uniform memory access (UMA), as in SMP: All processors have access to all parts of memory. Access time to all regions of memory is the same. Access time for different processors is the same. Non-uniform memory access (NUMA) All processors have access to all parts of memory. Access time of processor differs depending on region of memory. Different processors access different regions of memory at different speeds. Zebo Peng, IDA, LiTH 25 Motivation for NUMA SMP has stringent limit to the number of processors. Bus traffic limits to about 16-64 processors. A multi-port memory has limited number of inputs (ca. 10). In cluster computers each node has its own memory. Applications don t see a large global memory. Coherence maintained by software not hardware (i.e., slow speed). NUMA retains SMP flavour while making large scale multiprocessing possible. e.g., Silicon Graphics Origin NUMA machine has 1024 processors ( 1 teraflops peak performance). NUMA provides a transparent system-wide memory while allowing nodes to have their own bus or internal interconnections. Zebo Peng, IDA, LiTH 26 13

A Typical NUMA Organization (1) Each node is, in general, an SMP. Each node has its own main memory. Each processor has its own L1 and L2 caches. Zebo Peng, IDA, LiTH 27 A Typical NUMA Organization (2) Nodes are connected by some networking facility. Each processor sees a single addressable memory space: Each memory location has a unique systemwide address. Zebo Peng, IDA, LiTH 28 14

A Typical NUMA Organization (3) Load X Memory request order: L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory (via the interconnect network) All is done automatically and transparent to the processor. With very different access time! Zebo Peng, IDA, LiTH 29 NUMA Pros and Cons Better performance at higher levels of parallelism than SMP. However, performance can break down if too much access to remote memory, which can be avoided by: Designing better L1 and L2 caches to reduce memory access. Need good temporal and spatial locality of software. If software has good spatial locality, data needed for an application will reside on a small number of frequently used pages. They can be initially moved into the local memory. Enhancing VM by including in OS a page migration mechanism to move a VM page to a node that is frequently using it. Design of OS becomes an important issue for NUMA machines. Zebo Peng, IDA, LiTH 30 15

Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 31 Loosely Coupled MIMD Clusters Cluster: A set of computers is connected typically over a highbandwidth local area network, and used as a multi-computer system. A group of interconnected stand-alone computers. Work together as a unified resource. Each computer is called a node. A node can also be a multiprocessor itself, such as an SMP. Message passing for communication between nodes. There is no longer a shared memory space. NOW Network of Workstations, usually homogeneous cluster. Grid (Heterogeneous) computers, over a wide area network. Cloud computing A large number of computers connected via the Internet. Zebo Peng, IDA, LiTH 32 16

Cluster Benefits Absolute scalability: Cluster with a huge number of nodes is possible. Incremental scalability: A user can start with a small system and expand it as needed, without having to go through a major upgrade. High availability: Fault tolerance by nature. Superior price/performance: Can be built with cheap off-the-shelf nodes. With supercomputing-class off-the-shelf components. Amazon, Google, Facebook and Hotmail rely on clusters of PCs to provide services used by millions of users every day. Zebo Peng, IDA, LiTH 33 IBM s Blue Gene Supercomputer A supercomputer with a massive number of PowerPC processors and a huge memory space. 65,536 processors give 360 teraflops performance (BG/L). It led several times TOP500 and Green500 (energy efficiency) lists of supercomputers. It makes use of message passing for communication between nodes. Zebo Peng, IDA, LiTH 34 17

Blue Gene/L Architecture Zebo Peng, IDA, LiTH 35 Blue Gene/Q 64-bit PowerPC A2 processor with 4-way simultaneously multithreaded, running at 1.6 GHz. The processor cores are linked by a crossbar switch to a 32 MB L2 cache. 16 Processor cores are used for computing, and a 17th core for OS. A18th core is used as a redundant spare in case some core is permanently damaged. The 18 core chip has a peak performance of 204.8 GFLOPS. Blue Gene/Q has performance of 20 PFLOPS (world's fastest supercomputer in June 2012), which draws ca. 7.9 megawatts of power. One of the greenest supercomputers at over 2 GFLOPS/watt at that time (the fastest supercomputer now has 6.2 GFLOPS/watt). Zebo Peng, IDA, LiTH 36 18

Parallelizing Computation How to make a single application executing in parallel on a cluster: Paralleling complier: The compiler determines which parts can be executed in parallel. Parallelized application: Applications written from scratch to be parallel. Message passing is used to move data between nodes. Hard to program, but have the best end-results. Parametric computing: An application consists of repeated executions of the same algorithm over different sets of data. Ex. simulation of a system with different scenarios. Needs tools to organize, run, and manage the parallel tasks. Zebo Peng, IDA, LiTH 37 Google Applications Search engines require large amount of computations per request. A single query on Google (on average) reads hundreds of megabytes of data. takes tens of billions of CPU cycles. A peak request stream on Google means thousands of queries per second. requires supercomputer capability to handle. They can be easily parallelized different queries can run on different processors. a single query can also use multiple processors. Zebo Peng, IDA, LiTH 38 19

Google Cluster of PCs Cluster of off-the-shelf PCs cheap solution. More than 15,000 nodes form a cluster. Rather than a smaller number of high-end servers or a supercomputer. Important design considerations: Price-performance ratio. Energy efficiency (huge amount of energy is needed). Several clusters available worldwide. To handle catastrophic failures (earthquakes, power failures, etc.) Parametric computing: Same algorithm (PageRank to determine the relative importance of a web page). Different sets of data. Zebo Peng, IDA, LiTH 39 Summary Two most common MIMD architectures are symmetric multiprocessors and clusters. SMP is an example of tightly coupled system with shared memory. Cluster is a loosely coupled system with distributed memory. More recently, non-uniform memory access (NUMA) systems are also becoming popular. More complex in terms of design and operation. Several organization styles can be also integrated into a single system. E.g., SMP as a node inside a NUMA machine. MIMD architecture is highly scalable, therefore they can be build with a huge number of microprocessors. Zebo Peng, IDA, LiTH 40 20