Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system Computer that contains more than one processor Benefits Increased processing power Scale resource use to application requirements Additional operating system responsibilities All processors remain busy Even distribution of processes throughout the system All processors work on consistent copies of shared data Execution of related processes synchronized Mutual exclusion enforced Examples of multiprocessors Dual-processor personal computer Powerful server containing many processors Cluster of workstations Classifications of multiprocessor architecture Nature of datapath Interconnection scheme How processors share resources 3/55 4/55 Flynn s taxonomy SISD Based upon the number of concurrent instruction (or control) and data streams available SISD: Single instruction, single data stream MISD: Multiple instruction, single data stream SIMD: Single instruction, multiple data streams MIMD: Multiple instruction, multiple data streams Single instruction, single data stream Sequential computer Exploit minor parallelism: Pipelining Superscalar processors Hyper threading 5/55 6/55

MISD SIMD Multiple instruction, single data stream Not commonly used Redundancy Three processors do the same computation, results are compared Single instruction, multiple data streams One or more processing units Same instruction on a block of data Array or vector processors Most processors have SIMD instructions PowerPC s AltiVec, Intel s MMX, SSE, SSE2 and SSE3, AMD s 3DNow!, SPARCs VIS, PA-RISCs MAX and the MIPS MDMX and MIPS-3D. Multimedia 7/55 8/55 MIMD Processor interconnection schemes Multiple instruction, multiple data streams Multiple autonomous processors simultaneously executing different instructions on different data Multiprocessor systems Network of workstations Distributed systems Describes how the system s components, such as processors and memory modules, are connected Consists of nodes (components or switches) and links (connections) Parameters used to evaluate interconnection schemes Node degree Bisection width Network diameter Cost of the interconnection scheme 9/55 10 /55 Shared bus Shared bus Single communication path between all nodes Contention can build up for shared bus Lacks scalability Useful for interconnecting a small number of processors For systems with many processors, a bus could be used to interconnect nodes within a supernode Dynamic network Example: dual-processor Intel Pentium 11 /55 12 /55

Crossbar-switch matrix Crossbar-switch matrix Separate path from every processor to every memory module (or in general, from every node to every other node) For n processors and m memory modules, there will be n x m total switches High fault tolerance: Bisection width = (n x m)/2 High performance: Simultaneous transmissions can be supported Network diameter = 1 yields low latency High cost: proportional to n x m Example: Sun UltraSPARC-III 13 /55 14 /55 2-D mesh network 2-D mesh network n rows and m columns, in which a node is connected to nodes directly north, south, east and west of it Maximum node degree = 4 Moderate performance, fault tolerance and cost Not very scalable given that network diameter may be substantial Example: Intel Paragon (large-scale system, but communication is kept between neighboring nodes) 15 /55 16 /55 Hypercube Hypercube n-dimensional hypercube has 2 n nodes in which each node is connected to n neighbor nodes Better scalability and performance 4-D hypercube (16 nodes) results in network diameter of 4, while a 2-D mesh network with 16 nodes results in network diameter of 6 More fault tolerant, but more expensive than crossbar switch or 2-D mesh networks Example: ncube (up to 8192 processors) 17 /55 18 /55

Multistage networks Multistage networks Switch nodes act as hubs routing messages between nodes Cheaper (use simple hardware for switches that can be packed tightly) Less fault tolerant and worse performance compared to a crossbar-switch matrix Contention can occur at switches and network diameter is typically larger Example: IBM POWER4 19 /55 20 /55 Loosely coupled vs. tightly coupled systems Loosely coupled vs. tightly coupled systems Tightly coupled systems Processors share most resources including memory Communicate over shared buses Less burden for operating system programmers Less flexible, poor scalability and fault tolerance Loosely coupled systems Processors do not share most resources Dedicated memory Explicit messages using communication link or shared virtual memory More flexible, fault tolerant, scalable 21 /55 22 /55 Multiprocessor operating system organizations Master/slave Significantly different from uniprocessor systems Systems classified into three types (based on how processors share operating system responsibilities) Master/slave Separate kernels Symmetrical organization Master processor executes operating system code Computation and I/O jobs Slaves execute only user programs Processor-bound jobs Hardware asymmetry Slave processors have to wait for the master processor to handle the interrupt for OS operations Low fault tolerance Failure of the master processor results in system failure Good for computationally intensive jobs Example: ncube system 23 /55 24 /55

Separate kernels Symmetrical Each processor executes its own operating system and responds only to interrupts generated from user processes running on that processor Loosely coupled Little contention over resources Some globally shared operating system data System failure unlikely, but failure of one processor results in termination of processes on that processor Useful in environments where processes do not interact much and high availability is important Operating system manages a pool of identical processors High amount of resource sharing Need for mutual exclusion Allows processes to truly share the processors and exploit parallelism Highest degree of fault tolerance of any organization Facilitates graceful degradation Some contention for resources can limit performance improvement gained by adding more processors 25 /55 26 /55 UMA Classification based on how memory is shared Trade offs between performance, scalability and cost Uniform memory access (UMA) multiprocessor system All processors share main memory Access to any memory page is nearly the same for all processors and all memory modules Typically uses shared bus or crossbar-switch matrix Used in small multiprocessor systems (typically two to eight processors) 27 /55 28 /55 UMA NUMA Nonuniform memory access (NUMA) multiprocessor Global memory divided into memory partitions or modules Each node contains a few processors and a portion of system memory, which is local to that node Access to local memory faster than access to global memory (rest of the memory modules) More scalable than UMA with fewer bus collisions Most of processor s memory request served by local memory 29 /55 30 /55

NUMA COMA Basic physical interconnection similar to NUMA Main memory is viewed as a cache and called an attraction memory (AM) Allows system to migrate data to node that most often accesses it at granularity of a memory line (more efficient than a memory page) Reduces the number of cache misses serviced remotely Overhead: duplicated data items and complex protocol to ensure all updates are received at all processors 31 /55 32 /55 COMA NORMA No-Remote Memory Access Loosely coupled Simple to build: no complex interconnection No shared global memory Each node maintains its own local memory Communication through explicit messages Some implement the illusion of shared global memory known as shared virtual memory (SVM) Distributed systems Involve message passing via remote procedure calls (RPCs) Example: Google server farms 33 /55 34 /55 NORMA Memory sharing Memory coherence Ensure that all the processors have access to the most up- to-date copy of data Cache coherency Strategies to increase percentage of local memory hits Page migration Page replication 35 /55 36 /55

Cache coherence Page replication and migration UMA cache coherence Three different ways to implement Bus snooping (a.k.a. cache snooping) Maintain a centralized directory to indicate when to remove stale data Only one processor can cache a data item Cache coherent NUMA Each address is associated with a home node Contact home node on reads and writes to receive most updated version of data The protocol requires a maximum of three network traversals per coherency operation Page replication System places a copy of a memory page at a new node Good for pages read by processes at different nodes, but not written Page migration System moves a memory page from one node to another Good for pages written to by a single remote process Increases percentage of local memory hits Drawbacks Memory overhead Replicating and migrating more expensive than remotely referencing so only useful if referenced again 37 /55 38 /55 Scheduling Job-blind multiprocessor scheduling Determines the order and to which processors processes are dispatched Two goals Parallelism - timesharing scheduling Processor affinity - space-partitioning scheduling Soft affinity Hard affinity Types of scheduling algorithms Job-blind Job-aware Straightforward extension of uniprocessor algorithms Examples First-in-first out (FIFO) multiprocessor scheduling Round-robin (RR) multiprocessor scheduling Shortest process first (SPF) multiprocessor scheduling 39 /55 40 /55 Job-aware multiprocessor scheduling Coscheduling example Algorithms that maximize parallelism Shortest-number-of-processes-first (SNPF) scheduling Round-robin job (RRjob) scheduling Coscheduling Processes from the same job placed in adjacent spots in the global run queue Sliding window moves down run queue running adjacent processes that fit into window (size = number of processors) Algorithms that maximize processor affinity Dynamic partitioning 41 /55 42 /55

Dynamic partitioning Process migration Transferring a process from one node to another node Benefits of process migration Increases fault tolerance Load balancing Reduces communication costs Resource sharing 43 /55 44 /55 Flow of process migration Flow of process migration Either sender or receiver initiates request Sender suspends migrating process Sender creates message queue for migrating process s messages Sender transmits state to a dummy process at the receiver Sender and receiver notify other nodes of process s new location Sender deletes its instance of the process 45 /55 46 /55 Process migration concepts Process migration strategies Trade-off between residual dependency and initial migration latency Characteristics of a successful migration strategy Minimal residual dependency Transparent Scalable Fault tolerant Heterogeneous Eager migration Transfer entire state during initial migration Dirty eager migration Only transfer dirty memory pages Clean pages brought in from secondary storage Copy-on reference migration Similar to dirty eager except clean pages can also be acquired from sending node on demand Lazy copying No initial transfer of pages 47 /55 48 /55

Process migration strategies Load balacing Flushing migration Flush all dirty pages to disk before migration Transfer no memory pages during initial migration Precopy migration Begin transferring memory pages before suspending process Migrate once dirty pages reach a lower threshold System attempts to distribute the processing load evenly among processors Reduces variance in response times Ensures no processors remain idle while other processors are overloaded Two types Static load balancing Dynamic load balancing 49 /55 50 /55 Static load balacing Static load balacing using graphs Useful in environments in which jobs exhibit predictable patterns Goals Distribute the load Reduce communication costs Solve using a graph Must use approximations for large systems to balance loads with reasonable overhead 51 /55 52 /55 Dynamic load balancing Dynamic load balancing More useful than static load balancing in environments where communication patterns change and processes are created or terminated unpredictably Types of policies Sender-initiated Receiver-initiated Symmetric Random Algorithms Bidding algorithm Drafting algorithm Communication issues Communication with remote nodes can have long delays; load on a processor might change in this time Coordination between processors to balance load creates many messages Can overload the system Solution Only communicate with neighbor nodes Load diffuses throughout the system 53 /55 54 /55

Summary Next time: networks and distributed systems 55 /55